-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cohorts): optimized select from cohort_people #21564
Conversation
This is one main reason we moved off of sign and onto version: The collapsing merge tree has shit-resiliency. Ideally nothing uses the cohort sign queries anymore. I don't think you need to filter on sign either in the new query? If the update is stale (ex: new version has added stuff since you queried for the version from postgres) then you still want to return all data from this version, rather than just the +ve signed data, since the latter may be empty, and the former will be all persons from the previous computation, which is much closer. And if its the latest version, sign has no meaning anyway |
Thanks for the context! When I looked around, all previous versions essentially had the same data, but with a |
ohh good point! I think you should go with The problem with just |
team_id=self.team.pk, | ||
distinct_ids=["2"], | ||
properties={"$some_prop": "something", "$another_prop": "something2"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will like this test more if it also had a third person where $some_prop = "not something"
], | ||
name="cohort1", | ||
) | ||
cohort1.calculate_people_ch(pending_version=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very optional, to make it even more rock solid, you can
cohort1.calculate_people_ch(pending_version=0) | |
cohort1.calculate_people_ch(pending_version=0) | |
cohort1.calculate_people_ch(pending_version=2) | |
cohort1.calculate_people_ch(pending_version=4) |
to simulate a few recalculations
WHERE equals(cohortpeople.team_id, 420) | ||
GROUP BY person_id, cohort_id, cohort_people___person_id | ||
HAVING ifNull(greater(sum(cohortpeople.sign), 0), 0)) AS cohort_people LEFT JOIN ( | ||
WHERE and(equals(cohortpeople.team_id, 420), false)) AS cohort_people LEFT JOIN ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false
- I'm assuming because wherever this test is hasn't setup the cohort version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely. We have never stored any data in the cohort, so no need to query anything. Added an extra test to check it as well.
@@ -111,7 +111,7 @@ export function HogQLDebug({ query, setQuery, queryKey }: HogQLDebugProps): JSX. | |||
<LemonSelect | |||
options={[ | |||
{ value: 'auto', label: 'auto' }, | |||
{ value: 'leftjoin', label: 'join' }, | |||
{ value: 'leftjoin', label: 'leftjoin' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flyby fix 🙈
Size Change: 0 B Total Size: 999 kB ℹ️ View Unchanged
|
Problem
The following HogQL doesn't always return what you expect:
Behind the scenes it's translated to an operation that groups by
person_id
,cohort_id
and withhaving(sum(sign)) > 0
. This works well if the data is correct, but sometimes cohort updates fail leaving messy data. Here's a customer case wheresum(sign)
equals-16
or-9
or0
over all rows, even though the lastversion
has all with a+1
sign.Changes
In "In Cohort" queries we ignore the "sign" field and just check for the latest "version" field. When selecting from the table itself we used just
sum(sign) > 0
, ignoring theversions
.We'll now do the same we do with 'in cohort' checks: fetch the version information from Postgres and inline it into the query.
Thus this swaps the query
select cohort_id, person_id from cohort_people
in the following way.Before:
After:
I tried inlining the versions in ClickHouse with a
(select cohort_id, max(version) as version from raw_cohort_people group by cohort_id)
subquery, but on "team 2", this query alone took ~5 sec. Smaller customer teams ran much faster, but this is obviously slow for larger users.Does this work well for both Cloud and self-hosted?
Yes
How did you test this code?
WIP